2 research outputs found

    High-dimensional Sparse Count Data Clustering Using Finite Mixture Models

    Get PDF
    Due to the massive amount of available digital data, automating its analysis and modeling for different purposes and applications has become an urgent need. One of the most challenging tasks in machine learning is clustering, which is defined as the process of assigning observations sharing similar characteristics to subgroups. Such a task is significant, especially in implementing complex algorithms to deal with high-dimensional data. Thus, the advancement of computational power in statistical-based approaches is increasingly becoming an interesting and attractive research domain. Among the successful methods, mixture models have been widely acknowledged and successfully applied in numerous fields as they have been providing a convenient yet flexible formal setting for unsupervised and semi-supervised learning. An essential problem with these approaches is to develop a probabilistic model that represents the data well by taking into account its nature. Count data are widely used in machine learning and computer vision applications where an object, e.g., a text document or an image, can be represented by a vector corresponding to the appearance frequencies of words or visual words, respectively. Thus, they usually suffer from the well-known curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion phenomena, which both cannot be handled with a generic multinomial distribution, typically used to model count data, due to its dependency assumption. This thesis is constructed around six related manuscripts, in which we propose several approaches for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive word occurrences. In such frameworks, a suitable distribution is used to introduce the prior information into the construction of the statistical model, based on a conjugate distribution to the multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous computational advantages. Thus, we proposed a novel model that we call the Multinomial Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow more modeling flexibility. Although these frameworks can model burstiness and overdispersion well, they share similar disadvantages making their estimation procedure is very inefficient when the collection size is large. To handle high-dimensionality, we considered two approaches. First, we derived close approximations to the distributions in a hierarchical structure to bring them to the exponential-family form aiming to combine the flexibility and efficiency of these models with the desirable statistical and computational properties of the exponential family of distributions, including sufficiency, which reduce the complexity and computational efforts especially for sparse and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach for count data to overcome several issues that may be caused by the high dimensionality of the feature space, such as over-fitting, low efficiency, and poor performance. Furthermore, we handled two significant aspects of mixture based clustering methods, namely, parameters estimation and performing model selection. We considered the Expectation-Maximization (EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model parameters, with incorporating several techniques to avoid its initialization dependency and poor local maxima. For model selection, we investigated different approaches to find the optimal number of components based on the Minimum Message Length (MML) philosophy. The effectiveness of our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV shows, facial expression recognition, face identification, and age estimation

    Evaluating the Dynamics of Knowledge-Based Network Through Simulation: The Case of Canadian Nanotechnology Industry

    Get PDF
    Collaboration is a major factor in the knowledge and innovation creation in emerging science-driven industries, where the technology is rapidly changing and constantly evolving, such as nanotechnology. The scientific collaborations among individuals and organizations form knowledge co-creation network within which information is shared, innovative ideas are exchanged and new knowledge is generated. Although various simulation attempts have been carried out recently to analyze the performance of such networks at the firm level, the individual level has not been much explored in the literature yet. The objective of this thesis is to investigate the role of individual scientists and their collaborations in enhancing the knowledge flows, and consequently the scientific production within the Canadian nanotechnology scientists. The methodology involves two main phases. First, in order to understand the collaborative behavior of scientists in the real world, the data on all the nanotechnology journal publications in Canada was extracted from the SCOPUS database and the scientists' research performance and partnership history was analyzed using social network analysis. Moreover, the predominant properties that make a scientist sufficiently attractive to be selected as a research partner were determined using data mining and through a questionnaire sent directly to the researchers selected from our database. In the second phase, an agent-based model using Netlogo has been developed to simulate the knowledge-based network where several factors regarding the ratio, existence and absence of various categories of scientists could be controlled. It was found that scientists in centralized positions in such network have a considerable positive impact on the knowledge flows, while loyalty and cliquishness negatively affected the knowledge transmission. Star scientists appear to play a substitutive role in the network as most famous and trustable partners to be selected when usual collaborators are scarce or missing. Besides, the changes in the performance of some categories in case of the absence of others have been also observed. The major contribution of this work stems from the fact that the developed simulation model is the first one, which is fully based on the real data and on the observed behavior of the scientists in knowledge-based network
    corecore